Skip to content

Conversation

@mattt
Copy link
Owner

@mattt mattt commented Feb 9, 2026

Follow-up to #106

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the on-device structured JSON generation path to use XGrammar-based token masking, replacing the prior schema-constrained generator flow in the MLX and Llama model implementations.

Changes:

  • Add swift-xgrammar dependency and integrate XGrammar into the package target.
  • Implement XGrammar-driven structured JSON generation loops for MLXLanguageModel and LlamaLanguageModel.
  • Add new structured-generation error cases (schemaEncodingFailed, grammarMismatch) in both models.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 5 comments.

File Description
Sources/AnyLanguageModel/Models/MLXLanguageModel.swift Switch structured JSON generation to XGrammar matcher + token-bitmask sampling.
Sources/AnyLanguageModel/Models/LlamaLanguageModel.swift Switch structured JSON generation to XGrammar matcher + token-bitmask sampling; make generation async.
Package.swift Add swift-xgrammar dependency and link XGrammar product to the main target.
Package.resolved Update resolved pins to include swift-xgrammar (and other pin changes).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1039 to +1045
let token = try backend.sample(using: bitmask, applyMask: needsMask)
if backend.endTokens.contains(token) {
break
}
guard matcher.accept(Int32(token)) else {
throw LlamaLanguageModelError.grammarMismatch
}
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generation loop breaks immediately when a sampled token is in endTokens, before checking/advancing the grammar matcher. If the model samples EOS/EOT before the grammar has fully terminated (or in a step where masking is not applied), this will return incomplete/invalid JSON. Consider only stopping on end tokens when matcher.isTerminated is already true (or accept the token into the matcher and verify termination) so early EOS doesn’t truncate the structured output.

Copilot uses AI. Check for mistakes.
Comment on lines +927 to +950
private func tokenizerInfo(
for vocab: OpaquePointer,
vocabSize: Int,
stopTokens: Set<Int>
) throws -> TokenizerInfo {
guard vocabSize > 0 else {
throw LlamaLanguageModelError.contextInitializationFailed
}

var encodedVocab: [String] = []
encodedVocab.reserveCapacity(vocabSize)
for tokenId in 0 ..< vocabSize {
let token = llama_token(tokenId)
encodedVocab.append(tokenToText(vocab: vocab, token: token) ?? "")
}

let stopTokenIDs = stopTokens.map { Int32($0) }
return try TokenizerInfo(
encodedVocab: encodedVocab,
encoding: .byteFallback,
stopTokenIDs: stopTokenIDs,
addPrefixSpace: false
)
}
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building TokenizerInfo reconstructs the entire encodedVocab (size = vocabSize) for every structured-generation request. This is expensive and deterministic for a given model/vocab; consider caching the computed TokenizerInfo (keyed by vocab pointer + vocabSize, or by model ID) to avoid repeatedly traversing the entire vocabulary.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to 50
.product(name: "XGrammar", package: "swift-xgrammar"),
.product(
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XGrammar is only imported/used inside the MLX and Llama conditional compilation blocks, but the target depends on it unconditionally. This forces consumers to fetch/build swift-xgrammar even when neither trait is enabled, and can break traitless builds if that dependency has platform/toolchain constraints. Consider making the XGrammar product dependency conditional on the MLX and/or Llama traits (similar to MLXLLM / LlamaSwift).

Suggested change
.product(name: "XGrammar", package: "swift-xgrammar"),
.product(
.product(
name: "XGrammar",
package: "swift-xgrammar",
condition: .when(traits: ["MLX", "Llama"])
),
.product(

Copilot uses AI. Check for mistakes.
Comment on lines +864 to +882
let backend = try MLXTokenBackend(
context: context,
input: lmInput,
parameters: generateParameters,
maximumTokens: maxTokens,
endTokens: []
)

var generator = try ConstrainedJSONGenerator(backend: backend, schema: schema)
let json = try generator.generate()
let jsonSchema = try jsonSchemaString(for: schema)
let grammar = Grammar(jsonSchema: jsonSchema, formatting: .compact, strictMode: true)
let tokenizerInfo = try tokenizerInfo(
for: context.tokenizer,
vocabSize: backend.vocabSize,
stopTokens: backend.endTokens
)
let matcher = try await grammar.matcher(
for: tokenizerInfo,
stopTokens: backend.endTokens.map { Int32($0) },
terminatesWithoutStopToken: true
)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structured generation constructs MLXTokenBackend with endTokens: [], which disables the backend’s normal EOS/EOT end-token detection. With the new XGrammar path, stopTokens and the loop’s early-stop check both rely on backend.endTokens, so this ends up passing an empty stop-token set to the matcher and prevents clean termination via EOS. Consider passing endTokens: nil here (use the backend’s default end tokens) and plumbing those into the matcher / termination logic so generation can stop cleanly when the model emits EOS after the grammar is satisfied.

Copilot uses AI. Check for mistakes.
Comment on lines +1118 to +1132
if applyMask {
var allowedIndices: [UInt32] = []
allowedIndices.reserveCapacity(vocabSize)
for tokenId in 0 ..< vocabSize where bitmask.isTokenAllowed(tokenId) {
allowedIndices.append(UInt32(tokenId))
}
guard !allowedIndices.isEmpty else {
throw MLXLanguageModelError.grammarMismatch
}
let allowedArray = MLXArray(allowedIndices)
let maskedLogits = full(logits.shape, values: -Float.infinity)
maskedLogits[0..., allowedArray] = logits[0..., allowedArray]
let sampledToken = sampler.sample(logits: maskedLogits)
return sampledToken.item(Int.self)
}
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The masked sampling path is O(vocabSize) per generated token (scans every token ID, builds an indices array, and allocates a full maskedLogits tensor each step). For typical vocab sizes (50k–200k) this can be a major bottleneck for structured generation. Consider reusing buffers across steps and/or applying the mask more directly (e.g., materializing only the mask once per step without reserving vocabSize, or using a bitmask→indices iterator if XGrammar exposes one) to avoid repeated full-vocab scans and allocations.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant